Generalizability of machine learning in predicting antimicrobial resistance in E. coli: a multi-country case study in Africa

Background Antimicrobial resistance (AMR) remains a significant global health threat particularly impacting low- and middle-income countries (LMICs). These regions often grapple with limited healthcare resources and access to advanced diagnostic tools. Consequently, there is a pressing need for innovative approaches that can enhance AMR surveillance and management. Machine learning (ML) though underutilized in these settings, presents a promising avenue. This study leverages ML models trained on whole-genome sequencing data from England, where such data is more readily available, to predict AMR in E. coli, targeting key antibiotics such as ciprofloxacin, ampicillin, and cefotaxime. A crucial part of our work involved the validation of these models using an independent dataset from Africa, specifically from Uganda, Nigeria, and Tanzania, to ascertain their applicability and effectiveness in LMICs. Results Model performance varied across antibiotics. The Support Vector Machine excelled in predicting ciprofloxacin resistance (87% accuracy, F1 Score: 0.57), Light Gradient Boosting Machine for cefotaxime (92% accuracy, F1 Score: 0.42), and Gradient Boosting for ampicillin (58% accuracy, F1 Score: 0.66). In validation with data from Africa, Logistic Regression showed high accuracy for ampicillin (94%, F1 Score: 0.97), while Random Forest and Light Gradient Boosting Machine were effective for ciprofloxacin (50% accuracy, F1 Score: 0.56) and cefotaxime (45% accuracy, F1 Score:0.54), respectively. Key mutations associated with AMR were identified for these antibiotics. Conclusion As the threat of AMR continues to rise, the successful application of these models, particularly on genomic datasets from LMICs, signals a promising avenue for improving AMR prediction to support large AMR surveillance programs. This work thus not only expands our current understanding of the genetic underpinnings of AMR but also provides a robust methodological framework that can guide future research and applications in the fight against AMR. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10214-4.

Generalizability of machine learning in predicting antimicrobial resistance in E. coli: a multi-country case study in Africa Background Antimicrobial resistance (AMR) is a pressing global health challenge that threatens human and animal wellbeing [1].Recognized as a priority by the World Health Organization (WHO) and the United Nations General Assembly [2], AMR's unchecked proliferation could lead to catastrophic consequences, with Africa alone projected to account for millions of annual deaths by 2050 [3].In 2019, reports showed that AMR all-age death rates were highest in some low-and middle-income countries (LMICs), making AMR not only a major health problem globally but a particularly serious problem for some of the poorest countries in the world [4].
The WHO launched the Global Antimicrobial Resistance and Use Surveillance System (GLASS) to enhance AMR evidence base for priority pathogens including Escherichia coli, Klebsiella pneumoniae, Acinetobacter baumannii, Staphylococcus aureus, Streptococcus pneumoniae, Salmonella spp and others.While capacities for antimicrobial susceptibility testing (AST) exist across Africa, they are unevenly distributed and often limited in scope, particularly in LMICS.The COVID-19 pandemic has however catalyzed the broader adoption of Next-Generation Sequencing (NGS) platforms in Africa, now increasingly available to support a range of disease surveillance programs, including AMR.This technological advance offers a valuable complement to traditional AST methods, although the distribution and accessibility of NGS capabilities remain variable across the continent.
In Uganda, available data indicates concerning levels of drug resistance among E. coli strains (45.62%) with substantial resistance to key antibiotics [5].Similarly, in Tanzania and Nigeria, studies have highlighted the growing challenge of AMR, reflecting patterns of resistance that may differ from other regions, thereby necessitating localized surveillance and tailored predictive models [6].These countries exemplify the diverse AMR landscape across Africa and underscore the need for enhanced detection methods and strengthening diagnostic programs [7][8][9].
Overall, the increasing availability of whole-genome sequence (WGS) data in dedicated databases, exemplified by tools like CARD and Resfinder, has facilitated the identification of antibiotic resistance determinants [10,11].Existing approaches for detecting AMR from microbial whole-genome sequence data, such as rule-based models relying on identifying causal genes in databases, have high accuracy for some common pathogens but are limited in detecting resistance caused by unknown mechanisms in other major pathogenic strains.Machine learning techniques, including random forest, support vector machines, and neural networks have shown great promise in predicting antimicrobial resistance [12].These methods excel in capturing complex patterns within large datasets and can directly learn valuable features from genomic sequence data without relying on assumptions about the underlying mechanisms of AMR.Previous studies using machine learning have demonstrated success in predicting AMR and pathogen invasiveness from genomic sequences [13][14][15][16][17][18].Despite this potential, the application of machine learning for AMR prediction has not been widely explored in LMICs, often due to data scarcity and the underrepresentation of AMR genetic determinants within reference databases [19].
To bridge this gap, we adopted a cross-continental approach, training machine learning models on data from England and validating them on datasets from Uganda, Tanzania and Nigeria.This strategy aimed to evaluate the efficacy of machine learning in predicting AMR for E. coli and assess the models' generalizability across diverse African settings and datasets.By leveraging microbial genomic data and advanced machine learning techniques, this study endeavored to enhance the accuracy and efficiency of AMR prediction, thus contributing significantly to the global battle against AMR.This comprehensive analysis provides crucial insights into the practical implementation and scalability of AMR prediction strategies, especially in LMICs where genomic data is limited and the burden of AMR is disproportionately high.

Study design
This was a cross-sectional study utilizing data collected in the past years to explore associations between predictors and outcomes.

Sample size
In this study, two datasets, referred to as the Africa data and the England data of E. coli strains were used.

Data description
The study focused on three antibiotics ciprofloxacin (CIP), ampicillin (AMP) & cefotaxime (CTX).Each of these represented an antibiotic from a different class of antibiotics (penicillins, cephalosporins, and fluoroquinolones).They are broad-spectrum antibiotics with activity against numerous Gram-positive and Gram-negative bacteria, including E. coli.These drugs were selected based on their increasing prevalence of resistance as reported in the GLASS report [5].In addition, data on resistance to these drugs was available in the study datasets, making them an ideal choice for the study.The study utilized data from one of the largest complete E. coli datasets that were already available online from the National Center for Biotechnology Information, eliminating the need for additional data collection efforts.We categorised the data into two primary datasets: ) in a study that was unravelling virulence determinants in extended-spectrum beta-lactamaseproducing E. coli from East Africa using WGS [22].The third dataset consisted 68 samples collected from Nigeria as part of a study looking at WGS data from E. coli isolates from South-West Nigeria hospitals [23] (Table 1).The samples that had not been screened for AST were removed from the dataset.

Variant calling of whole-genome sequencing data
The raw WGS paired-end reads were first quality checked and filtered by fastp 0.23.4 using its default parameters: adapter detection and trimming, sliding window quality filtering with a threshold of Q20, end trimming for low quality bases and removing reads shorter than 15 bp post-trimming [24].The filtered reads were aligned to the E. coli K-12 substr.MG1655 U00096.3 complete genome using Burrows-Wheeler Aligner-mem (0.7.17-r1188) algorithm with default seed length of 19, bandwidth of 100, and off-diagonal X-dropoff of 100 [25].BCFtools 1.18 was used for calling variants with a minimum of depth coverage of 10x and allelic frequency of 0.9 [26].SAMtools 1.18 was used to sort the aligned reads and BCFtools 1.18 was used to filter the raw variants applying default filtering thresholds, including a minimum read depth of 2, SNP quality of 20 [27].The entire bioinformatics workflow was subsequently executed on the Open Science Grid High Throughput Computing infrastructure [28,29].

SNPs pre-processing and encoding
We employed a previously established methodology for constructing the SNP matrix from the VCF files.First, the reference alleles, variant alleles, and their positions from the VCF files were extracted and merged with the isolates based on the position of the reference alleles.A SNP matrix was built where the rows represented the samples, and the columns represented the variant alleles [15].The SNPs were converted from characters to numbers through categorical encoding where the categories are converted to numbers.The SNPs were encoded for machine learning using label encoding, where the A, C, G, T in the SNP matrix were converted to 1,2,3,4 (Fig. 1).
It is acknowledged that certain machine learning models could misconstrue these as ordinal values; however based on previous studies demonstrating minimal performance difference between label, one-hot and Frequency Chaos Game Representation encoding methodologies [15], label encoding was selected for its computational efficiency in handling large genomic datasets.The missing values encoded as N were converted to 0. The gene positions that had more than 90% as null were removed and the remaining were selected for machine learning.The antibiotic phenotypes were encoded as binary values: 'S' for susceptible was mapped to 0, and 'R' for resistant was mapped to 1.

Machine learning
We trained eight machine learning algorithms, each selected for its unique capabilities in predictive modeling.
The training of these models was conducted individually for each antibiotic, focusing on one antibiotic at a time to ensure the specificity and accuracy of the predictions.Logistic Regression (LR) provided a baseline for binary classification, and Random Forest (RF) and Gradient Boosting (GB) were chosen for their effectiveness in handling high-dimensional data and intricate relationships.Support Vector Machines (SVM) were implemented with a sigmoid kernel, optimized through hyperparameter tuning to a C parameter of 9.795846277645586 and gamma set to 'auto' .Feed-Forward Neural Networks (FFNNs), designed using Keras 2.12.0, consisted of an input layer with 64 neurons, a hidden layer with 32 neurons, and an output layer with one neuron, using binary cross-entropy loss and the Adam optimizer.The FFNN was trained for 20 epochs with a batch size of 32, with hyperparameter tuning improving its configuration.XGBoost (XGB) with xgboost 1.7.6,LightGBM (LGB) using lightgbm 4.1.0,and CatBoost using catboost 1.2.2 were implemented with default parameters, leveraging their efficiencies with large-scale data.All models were implemented using Scikit-learn version 1.3.2,except for FFNNs which were implemented in Keras.Hyperparameter tuning was conducted for SVM and FFNNs using scikit-learn's RandomizedSearchCV, which helped identify the most effective configurations for these models.The training was performed on both originally imbalanced and balanced datasets.For balancing, a simple random down-sampling approach was employed to reduce the majority class, enabling us to assess the impact of class distribution on model performance.
This comprehensive approach, involving diverse algorithms and hyperparameter tuning, allowed for an exhaustive evaluation of predictive models in the detection of AMR, under varied dataset conditions.

Statistical evaluation
The machine learning models were optimized using five times 5-fold stratified cross-validation.For the final evaluation of the data from Africa, the performance was analyzed on the raw public dataset and on a balanced set using a downsampling strategy.The models were evaluated using the receiver operating characteristics curve (ROC) and the area under the curve (AUC).Precision, recall, f1-score, and accuracy for all models were calculated.In order to determine the statistical significance of the differences in AUC scores between models, we employed Tukey's Honestly Significant Difference (HSD) test [30].This test is appropriate for comparing all possible pairs of groups in a family of models without increasing the risk of Type I errors that multiple comparisons may induce.The significance threshold was set at α = 0.05, indicating that differences with p-values less than this threshold were considered statistically significant.The pairwise comparisons were conducted using statsmodels 0.14.0.

Identification of genes
To identify the top 10 most important features for the models mentioned, the methods for calculating feature importance varied between models.For tree-based models like Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost, we utilized the feature_importances attribute, which quantifies the contribution of each feature to the model's prediction.In Logistic Regression, feature importance was deduced from the absolute values of the coefficients.The SVM model employed the coefficients' absolute values for linear kernels and Select-KBest with the chi2 method for non-linear kernels.For the Keras Neural Network model, we averaged the absolute values of the weights in the first layer, reflecting the relevance of each feature in the model.The corresponding gene annotations were extracted from the reference genome for the identified SNPs.By examining the functional roles of these genes, an investigation of their potential contribution to antibiotic resistance mechanisms in E. coli was done (Fig. 2).

Performance of machine learning methods in predicting AMR
We assessed the performance of eight machine learning algorithms, including LR, RF, SVM, GB, XGB, LGB, CatBoost, and FFNN, in predicting antibiotic resistance in E. coli.Multiple metrics, such as accuracy, precision, recall, F1 score, and the area under the receiver operating characteristics (ROC) curve, were used for evaluation (Table 2).The models were optimized using 5-fold stratified cross-validation and confidence intervals recorded (Supplementary Material 1).Tukey's Honestly Significant Difference (HSD) test was employed for pairwise comparisons of AUC scores.
For CIP, we evaluated the models' effectiveness considering the class imbalance issue.We applied a random down-sampling strategy but didn't observe significant improvements.The FFNN emerged as the top performer with the highest mean AUC score (0.83), while SVM achieved the highest accuracy (0.87).HSD tests revealed significant performance differences between several pairs of models, specifically RF (p < 0.001) when compared to all the models.
For AMP, the SVM achieved the highest mean AUC score (0.72).GB had the highest F1 score and precision, and CB and SVM had the highest recall scores.
On the CTX, FFNN stood out with the highest mean AUC score (0.72), while SVM recorded the highest accuracy (0.92).The Random Forest model excelled in precision, and Logistic Regression had the highest F1 score (0.42) (Fig. 3).

Evaluation of the machine learning models on the Africa data
We assessed the generalizability of our machine learning models on an external dataset from Uganda, Nigeria and Tanzania, consisting of up to 170 samples with a severe class imbalance issue.Performance metrics for each model on this dataset (Table 3).
In the external validation with the African dataset, the class imbalance presented varied challenges across different antibiotics.For CIP, the Logistic Regression model exhibited an accuracy of 0.55 and precision of 0.59, but a recall of only 0.16.The RF model achieved an accuracy of 0.50 and an AUC-ROC score of 0.53.SVM displayed an accuracy of 0.50, while GB showed an accuracy of 0.52 and a recall of 0.32.XGB had an accuracy of 0.57, and both LGB and CatBoost had accuracies just above 0.55, with CB also attaining an AUC-ROC of 0.58.The FFNN model did not identify any true positives.
For AMP, LR achieved an accuracy of 0.94 and a nearperfect recall.RF had a precision of 0.93 but a lower accuracy of 0.38.SVM's performance was close to that of LR, with high accuracy and recall but a slightly lower AUC-ROC score of 0.57.GB, XGB, LGB, and CatBoost demonstrated solid accuracy and precision, albeit with varying AUC-ROC scores.The FFNN model's accuracy was at 0.05.
Regarding CTX, LR recorded an AUC-ROC of 0.39, RF exhibited high precision but low recall, and SVM had a precision of 1. GB had the highest accuracy among the models at 0.22 and the highest AUC-ROC score of 0.57.XGB and LGB showed higher accuracy and recall rates, with LGB achieving the highest recall of 0.38.The FFNN model again showed zero capacity for true positive identification (Fig. 4).

Marker genes associated with antibiotic resistance
A crucial part of machine learning in the genomic field is to interpret the model's results.In our case, the analysis of feature importance and interactions provided insights into which genetic mutations are most influential in predicting antibiotic resistance.For each model, we identified the top 10 features (SNP positions) with the highest importance scores, which reflect their contribution to the accuracy of the model's predictions.
For instance, in the Logistic Regression model on CIP, the mutation at position '3589009' has the highest importance score, followed by '4040529' , '1473047' , and so on.These positions potentially have a substantial impact on antibiotic resistance, as mutations in these areas of the gene could probably cause the bacteria to become Fig. 2 Flowchart showing how genes were identified resistant to specific antibiotics.The exact biological mechanism for this can be complex, involving changes in the gene's protein product that might render an antibiotic ineffective (Table 4).
The models used different ways to calculate these importance scores, which is why they differ between models.Still, positions that are consistently high across different models can be a strong indicator of their significance in conferring antibiotic resistance.

Gene annotation
The identification of genetic SNPs associated with antibiotic resistance can shed light on the underlying genetic mechanisms that contribute to drug resistance in E. coli (Table 5).By analyzing the top SNPs from each predictive model, key marker genes that potentially play a role in antibiotic resistance were identified.For CIP, SNPs were identified in the following genes: rlmL, yehB, rrfA, vciQ, and ygjK.For AMP, the implicated genes include rcsD, yjfI, tdcE, ugpB, ugpQ, and ggt.Lastly, for CTX, SNPs were found in ydbA, mltB, lomR, mppA, recD, and glyS.The identified SNPs in these genes underscore the complex and multifactorial nature of antibiotic resistance in E. coli.A variety of biological processes, such as membrane transport, rRNA methylation, DNA repair, and cell wall synthesis, are potentially collectively implicated in the development of resistance.Further experimental validation of these marker genes is warranted to confirm their role in antibiotic resistance.

Discussion
This study embarked on an explorative journey to understand the generalizability of machine learning models in predicting AMR in E. coli, utilizing datasets from England and multiple African countries.While the models showed promise on the England dataset, the application to the highly imbalanced African dataset illuminated significant challenges.The validation of machine learning models on the African dataset, which had a higher incidence of resistant strains compared to the training data from England, highlighted the challenges and potential of such tools; discrepancies in class distribution impacted performance measures like recall and precision, yet the robustness and real-world applicability of these models were affirmed when they successfully predicted resistance across varied datasets.
In the England dataset, models like SVM (Accuracy: 0.87, AUC-ROC: 0.86) and Logistic Regression (AUC-ROC: 0.77) demonstrated effectiveness.However, the transition to the African dataset, characterized by significant class imbalance, presented a stark contrast.For example, the Random Forest model experienced a decline in accuracy from 0.75 for CIP in the England dataset to 0.50 in the African dataset.The performance of the models on the African dataset, particularly in terms of recall, highlights potential overfitting to the England dataset and the need for more generalizable models.The disparity in class distribution between the datasets-where the England dataset had a higher proportion of susceptible strains and the African dataset had a higher proportion of resistant strains-presented both challenges and opportunities.
A notable observation in this study is the impressive performance of the models for predicting ampicillin (AMP) resistance in the African dataset, despite their moderate performance on the England dataset.For AMP, models demonstrated substantial accuracy and recall in the African dataset (e.g., Logistic Regression: Accuracy 0.94, Recall 0.99, F1 0.97), highlighting their effectiveness in identifying true resistance cases.This success may be attributed to the distinct resistance mechanisms of AMP, which were perhaps better captured in the training data, leading to more accurate predictions in the validation dataset, or the data representation of the AMP training dataset which might have contained patterns that were more representative of the resistance seen in the African dataset.
Moreover, the process of down-sampling the England dataset for training, while fostering a balanced environment, did not uniformly enhance model performance.While down-sampled models showed a slight improvement for AMP, indicating that down-sampling might enhance the model's sensitivity to specific resistance patterns associated with AMP, this effect was not as pronounced for CIP and CTX.
The identification of SNPs associated with antibiotic resistance can illuminate the genetic mechanisms driving drug resistance in E. coli.By analyzing the top SNPs from each predictive model, we identified key marker genes potentially involved in antibiotic resistance.For CIP, SNPs were identified in the following genes: ugpC, rlmL, yciQ, ygjK, yehB, rrfA, ytfB, and yjjW.These genes encode for various bacterial functions.For instance, ugpC is part of the glycerol-3-phosphate (G3P) transport system implicated in phospholipid biosynthesis, and RlmL is an enzyme involved in the methylation of ribosomal RNA (rRNA).mdtC is a component of multidrug efflux pump systems that can contribute to antibiotic resistance by actively pumping out antibiotics from bacterial cells.It's important to note that while machine learning can highlight these genes as candidates, experimental validation is essential to confirm their roles in antibiotic resistance.

Implications and applications
While our research concentrated primarily on three specific antibiotics, the methodology we've developed is versatile and readily adaptable for investigating other antibiotics and can be extended to resistance-associated SNPs in a variety of pathogens beyond just bacteria.This flexibility allows for a broader scope of study, opening the  door for a comprehensive understanding of AMR mechanisms.In addition, the applicability of our approach extends beyond the realm of infectious diseases, holding promise for other branches of biomedical research, such as predicting resistance to cancer treatments by enabling precise targeted therapy.

Limitations
While this study has provided valuable insights into predicting genotypic resistance to ciprofloxacin, ampicillin, and cefotaxime in E. coli strains, it is important to acknowledge several limitations that should be considered when interpreting the results.First, it is important to acknowledge the inherent limitation of focusing exclusively on SNPs as the single specific genomic factor.Antimicrobial resistance is a complex phenomenon influenced by various genomic drivers including resistance genes, insertion sequences, plasmids and AMR gene cassettes which collectively contribute to the intricate landscape of resistance mechanisms.Our study, by concentrating on SNPs, represents a deliberate simplification to ensure depth and clarity in our ML analysis, driven by data quality and the need for clinically interpretable models.However, we recognise that the exclusive emphasis on SNPs may not capture the entirety of the multifaceted interplay within resistance determinants.Furthermore, it is worth noting that the validation of the models on Africa data presented some challenges.The availability of whole-genome sequence data from Africa was limited, resulting in a relatively small dataset for model evaluation.Additionally, the African dataset exhibited high-class imbalance, where certain resistance classes were significantly underrepresented.This imbalance can introduce bias and affect the performance metrics of the models.Due to our study's uniqueness, traditional benchmarking might not capture our nuanced challenges.Future studies should explore alternative methodologies for a comprehensive evaluation of predictive models in diverse contexts.
Moreover, it is important to highlight that the performance of the models in this study is specific to the context of the datasets used, which may not fully represent the diversity and complexity of AMR patterns observed in other regions or populations.Therefore, caution should be exercised when generalizing the findings to different settings.Despite these limitations, this study provides a valuable foundation for future research and highlights areas for improvement and expansion.Incorporating additional variables, addressing the class imbalance, and expanding the dataset to include a more diverse range of sequences would enhance the robustness and applicability of the models.Overall, while the findings of this study contribute to our understanding of genotypic resistance prediction, it is important to recognize these

Conclusion
In conclusion, our study highlights the complex interplay between data composition, model training approaches, and predictive accuracy in the context of AMR.The impressive performance of models for AMP in the African dataset despite their moderate performance in the England dataset underscores the potential of machine learning in AMR prediction, given appropriate training and validation strategies.
The findings from this study serve as a crucial reminder of the complexities involved in applying machine learning models to predict AMR across diverse settings.It emphasizes the importance of developing robust, adaptable, and generalizable machine learning tools, capable of handling varied data landscapes and resistance mechanisms.Future research should focus on integrating larger and more diverse datasets while exploring innovative methods to maintain a balance between dataset size and class distribution, thus advancing the development of machine learning tools in the global fight against antimicrobial resistance.
As the threat of antimicrobial resistance continues to rise, the successful application of these models -particularly on the African dataset, signals a promising avenue for improving AMR detection and treatment strategies.This work thus not only expands our current understanding of the genetic underpinnings of antibiotic resistance but also provides a robust methodological framework that can guide future research and applications in the fight against antimicrobial resistance.

Fig. 1
Fig. 1 Illustration of the preprocessing and encoding process of the SNPs.Created with Biorender.com

Fig. 3
Fig. 3 Performance of different machine learning methods for predicting AMR on England microbial sequence data

Fig. 4
Fig. 4 Performance of different machine learning methods for predicting AMR on Africa microbial sequence data

Table 1
Overview of the data

Table 2
Performance of different machine learning methods for predicting AMR on England data

Table 3
Performance of different machine-learning methods for predicting AMR on Africa data

Table 4
A table showing the top 10 mutation positions that have the most significant impact on the algorithms across all drags